Guided Policy Search as Approximate Mirror Descent
نویسندگان
چکیده
Guided policy search algorithms can be used to optimize complex nonlinear policies, such as deep neural networks, without directly computing policy gradients in the high-dimensional parameter space. Instead, these methods use supervised learning to train the policy to mimic a “teacher” algorithm, such as a trajectory optimizer or a trajectory-centric reinforcement learning method. Guided policy search methods provide asymptotic local convergence guarantees by construction, but it is not clear how much the policy improves within a small, finite number of iterations. We show that guided policy search algorithms can be interpreted as an approximate variant of mirror descent, where the projection onto the constraint manifold is not exact. We derive a new guided policy search algorithm that is simpler and provides appealing improvement and convergence guarantees in simplified convex and linear settings, and show that in the more general nonlinear setting, the error in the projection step can be bounded. We provide empirical results on several simulated robotic navigation and manipulation tasks that show that our method is stable and achieves similar or better performance when compared to prior guided policy search methods, with a simpler formulation and fewer hyperparameters.
منابع مشابه
Guided Policy Search via Approximate Mirror Descent
Guided policy search algorithms can be used to optimize complex nonlinear policies, such as deep neural networks, without directly computing policy gradients in the high-dimensional parameter space. Instead, these methods use supervised learning to train the policy to mimic a “teacher” algorithm, such as a trajectory optimizer or a trajectory-centric reinforcement learning method. Guided policy...
متن کاملA Free Line Search Steepest Descent Method for Solving Unconstrained Optimization Problems
In this paper, we solve unconstrained optimization problem using a free line search steepest descent method. First, we propose a double parameter scaled quasi Newton formula for calculating an approximation of the Hessian matrix. The approximation obtained from this formula is a positive definite matrix that is satisfied in the standard secant relation. We also show that the largest eigen value...
متن کاملMirror Descent Search and Acceleration
In recent years, attention has been focused on the relationship between black box optimization and reinforcement learning. Black box optimization is a framework for the problem of finding the input that optimizes the output represented by an unknown function. Reinforcement learning, by contrast, is a framework for finding a policy to optimize the expected cumulative reward from trial and error....
متن کاملProjected Natural Actor-Critic
Natural actor-critics form a popular class of policy search algorithms for finding locally optimal policies for Markov decision processes. In this paper we address a drawback of natural actor-critics that limits their real-world applicability—their lack of safety guarantees. We present a principled algorithm for performing natural gradient descent over a constrained domain. In the context of re...
متن کاملProvable Bayesian Inference via Particle Mirror Descent
Since the prox-mapping of stochastic mirror descent is intractable when directly being applied to the optimization problem (1), we propose the -inexact prox-mapping within the stochastic mirror descent framework in Section 3. Instead of solving the prox-mapping exactly, we approximate the solution with error. In this section, we will show as long as the approximation error is tolerate, the stoc...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1607.04614 شماره
صفحات -
تاریخ انتشار 2016